Materials and methods

Materials


We are working with a dataset about Breast Cancer that we have obtained from kaggle website

This is the dataset we are working with:

patient_id gender education treatment_data id_healthcenter id_treatment_region hereditary_history birth_date age weight
111036008041 0 4 2019 1.11e+09 1.11e+09 1 1989 30 69
111035996130 0 6 2019 1.11e+09 1.11e+09 0 1989 30 71
111035971333 0 5 2019 1.11e+09 1.11e+09 0 1989 30 74
111036018485 0 5 2019 1.11e+09 1.11e+09 1 1989 30 75
111035985474 0 1 2019 1.11e+09 1.11e+09 0 2009 10 70
111035903616 0 3 2019 1.11e+09 1.11e+09 1 1989 30 79

Cleaning the data


Cleaning the data


BEFORE AFTER
THE COLUMNS ARE DIFFERENT TYPES EACH COLUMN HAS A CORRECT TYPE
0, 1, 2 VALUES BOLEAN VARIABLES
NAMES WITH /R/N CLEAN NAMES
BIRTH DATE WITH 3 CHARACTERS BIRTH DATE WITH 4 CHARACTERS
BLOOD TYPE 44 CORRECT BLOOD TYPES ONLY
WEIRD WEIGHT/AGE CORRELATIONS ELIMINATING PEOPLE UNDER 20 YEARS OLD AND 35 KG
WOMEN AND MEN ONLY WOMEN

Augmenting the data


  • We have added more informative columns
  • We have changed the type of the columns

Statistical analysis


We have created some plots in order to fully understand the data and we have done some statistical analysis like MCA analysis. The plots are shown in the following point: “Results”

Results

Barchar of categorical variables


//: # Variables that affect health (medicines, vicious habits) have a great incidence in breast cancer patients //: # Early menstrual periods before age 12 and starting menopause after age 55 expose women to hormones longer, raising their risk of getting breast cancer

Boxplots stratified on conditions


Boxplots of numerical variables


Histograms stratified on condition


Histograms stratified on condition


//: # In most cases, when having taking medicine the death is higher (no sense). //: # Not drinking alcohol or smoking improves recovery. //: # When taking alcohol and smoking the death is lower (it doesn’t make any sense) //: # These are absolute values, maybe we should calculate some relative values

Densitograms stratified on condition


Histograms stratified on condition


Heatmap of numerical variables


Dynamic representation of variables


MCA analysis


Discussion

Discussion


Conclusion

Conclusion


We have reached the following conclusions

Bibliography

Bibliography


  • Breast cancer incidence (invasive) statistics (10 March 2020). Cancer Research UK
  • Height Percentile Calculator for Men and Women in the United States. DQYDJ
  • Weight Gain After Breast Cancer Diagnosis May Be a Bigger Issue Than Thought in Australia (13 March 2020) BreastCancer.org.

THANKS
FOR YOUR ATTENTION